A study of n-gram and decision tree letter language modeling methods
نویسندگان
چکیده
The goal of this paper is to investigate various language model smoothing techniques and decision tree based language model design algorithms. For this purpose, we build language models for printable characters (letters), based on the Brown corpus. We consider two classes of models for the text generation process: The n-gram language model and various decision tree based language models. In the rst part of the paper, we compare the most popular smoothing algorithms applied to the former. We conclude that the bottom-up deleted interpolation algorithm performs the best in the task of n-gram letter language model smoothing, signi cantly outperforming the back-o smoothing technique for large values of n. In the second part of the paper, we consider various decision tree development algorithms. Among them, a K-means clustering type algorithm for the design of the decision tree questions gives the best results. However, the n-gram language model outperforms the decision tree language models for letter language modeling. We believe that this is due to the predictive nature of letter strings, which seems to be naturally modeled by n-grams. Zusammenfassung Das Ziel dieses Beitrags ist verschiedene Techniken zur Glattung von Sprachmodellen und Algorithmen zum Entwurf von Sprachmodellen auf der Basis von Entscheidungsb aumen zu untersuchen. Zu diesem Zweck verwenden wir den BrownKorpus umModelle von Buchstabenfolgen zu erstellen. Wir betrachten zwei Klassen von Modellen zur Textgenerierung: das n-Gramm Sprachmodell sowie verschiedene auf Entscheidungsbaumen basierende Verfahren. Im ersten Teil dieses Beitrags vergleichen wir die am h au gsten benutzten Glattungsalgorithmen angewandt auf n-Gramme. Wir folgern, da der \bottom-up deleted interpolation"-Algorithmus am besten zur Glattung von n-Gramm Sprachmodellen geeignet ist und f ur gro e n dem \back-o "-Verfahren deutlich  uberlegen ist. Im zweiten Teil dieses Beitrags betrachten wir dann verschiedene Algorithmen zur Bildung von Entscheidungsbaumen. Unter diesen erzielt ein K-means- ahnlicher Algorithmus die besten Ergebnisse beim Entwurf der Fragen die der Entscheidungsbaum stellt. F ur die Modellierung yCorresponding author. E-mail: [email protected]. von Buchstabenfolgen erzielt das n-Gramm Sprachmodell aber trotzdem noch bessere Ergebnisse als alle Entscheidungsb aume. Wir glauben, da dies durch die Fahigkeit der n-Gramme Buchstabenfolgen vorherzusagen begr undet ist. R esum e Le but de cet article est d' etudier di erentes techniques de lissage de mod eles de langage et di erents algorithmes de construction de mod eles de langage a base d'arbres de d ecision. Pour cela, nous construisons des mod eles de langage pour des caract eres ecrits (lettres) a partir du Brown corpus. Nous consid erons deux classes de mod eles pour le processus de g en eration du texte : le mod ele de langage n-gram, et di erents mod eles de langage a base d'arbres de d ecision. Dans la premi ere partie de l'article, nous comparons les algorithmes de lissage les plus couramment appliqu es au mod ele de langage n-gram. L'algorithme \bottom-up deleted interpolation" donne les meilleurs r esultats pour le lissage du mod ele de langage n-gram, d epassant de fa con signi cative la technique de lissage \back-o " pour de grandes valeurs de n. Dans la seconde partie de l'article, nous consid erons di erents algorithmes de d eveloppement d'arbres de d ecision. Parmi eux, un algorithme de type classi cation K-means aboutit aux meilleurs r esultats pour la construction des arbres de d ecision. Cependant, le mod ele de langage n-gram fournit de meilleurs r esultats que les mod eles de langage a arbres de d ecision pour la mod elisation du langage ecrit. Nous croyons que cela est dû a la nature pr edictive des châ nes de caract eres, qui semble être naturellement mod elis ee par les n-grams.
منابع مشابه
Syntactic Decision Tree LMs: Random Selection or Intelligent Design?
Decision trees have been applied to a variety of NLP tasks, including language modeling, for their ability to handle a variety of attributes and sparse context space. Moreover, forests (collections of decision trees) have been shown to substantially outperform individual decision trees. In this work, we investigate methods for combining trees in a forest, as well as methods for diversifying tre...
متن کاملTitle of dissertation : DECISION TREE - BASED SYNTACTIC LANGUAGE MODELING
Title of dissertation: DECISION TREE-BASED SYNTACTIC LANGUAGE MODELING Denis Filimonov, Doctor of Philosophy, 2011 Dissertation directed by: Dr. Mary Harper Department of Computer Science Dr. Philip Resnik Department of Linguistics Statistical Language Modeling is an integral part of many natural language processing applications, such as Automatic Speech Recognition (ASR) and Machine Translatio...
متن کاملGeneralized Interpolation in Decision Tree LM
In the face of sparsity, statistical models are often interpolated with lower order (backoff) models, particularly in Language Modeling. In this paper, we argue that there is a relation between the higher order and the backoff model that must be satisfied in order for the interpolation to be effective. We show that in n-gram models, the relation is trivially held, but in models that allow arbit...
متن کاملUsing Random Forests in the Structured Language Model
In this paper, we explore the use of Random Forests (RFs) in the structured language model (SLM), which uses rich syntactic information in predicting the next word based on words already seen. The goal in this work is to construct RFs by randomly growing Decision Trees (DTs) using syntactic information and investigate the performance of the SLM modeled by the RFs in automatic speech recognition...
متن کاملRanking stocks of listed companies on Tehran stock exchange using a hybrid model of decision tree and logistic regression
Much research has introduced linear or nonlinear models using statistical models and machine learning tools in artificial intelligence to estimate Iran's rate of return. The primary purpose of these methods is simultaneously use different independent variables to improve stock return rates' modeling. However, in predicting the rate of return, in addition to the modeling method, the degree of co...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Speech Communication
دوره 24 شماره
صفحات -
تاریخ انتشار 1998